Scatterplot with birth data:
library(mosaic)
library(mosaicData)
xyplot(births ~ dayofyear, data = Births78)
Other commands we’ve run during the workshop day 1 and some comments on them.
histogram(~ age, data = HELPrct) #histogram of age
favstats(~ age, data = HELPrct) #My favorite statistics; it ignores NAs
## min Q1 median Q3 max mean sd n missing
## 19 30 35 40 60 35.65342 7.710266 453 0
tally(~sex, data = HELPrct) #Count of gender
##
## female male
## 107 346
tally(~sex, format = "percent", data = HELPrct) #Percents of gender
##
## female male
## 23.62031 76.37969
tally(~sex, format = "proportion", data = HELPrct) #Proportions of gender
##
## female male
## 0.2362031 0.7637969
tally(~substance, data = HELPrct) #Count of substance
##
## alcohol cocaine heroin
## 177 152 124
tally(~substance, format = "perc", data = HELPrct) #Percents of substance
##
## alcohol cocaine heroin
## 39.07285 33.55408 27.37307
tally(sex ~ substance, data = HELPrct) #Cross-tab sex & substance
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
tally(~ sex + substance, data = HELPrct) #Ditto, just a different format
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
Also, always remember the mplot command for producing graphs and clicking on the ‘Show Expression’.
Consider cases and variables
In tidy data:
All of this goes into a codebook only.
DTK: * Glyphs are marks; data glyps are also marks and the features of the glyphs encode the values of the variables & the visual properties are aesthetics. The choices that we make as experts in our field is the choice of aesthetic to map to variables. The word ‘aesthetic’ here is taken from its early When we make data glyps we map variables to aesthetics. A scale is for a computer and a guide is for a person (about the aesthetics). A legend (beside a graph) is an example of a guide. A data table is glyph-ready when there is one row for each glyp to be drawn (I could say, “that’s x-position, that’s y-position, that’s color, that’s size.” Glyph-ready data are tidy data, but tidy data are not necessarily glyph-ready. *Sometimes glyphs represent the collective properties of variables, e.g. in the case of histograms.
RP:
require(lubridate)
data(Births78)
head(Births78, 3)
## date births dayofyear
## 1 1978-01-01 7701 1
## 2 1978-01-02 7527 2
## 3 1978-01-03 8825 3
ggplot() + geom_point()aes(), e.g. aes(x = date, y = births)ggplot(data = Births78, aes(x = date, y = births)) + geom_point()
But, we need to add days of the week because that’s more useful to us:
Births78 <-
Births78 %>%
mutate(wday = wday(date, label = TRUE))
ggplot(data = Births78, aes(x = date, y = births, color = wday)) + geom_point()
Note, that the same graph would be generated by the following:
ggplot(data = Births78) + geom_point(aes(x = date, y = births, color = wday))
We could change this to a line graph:
ggplot(data = Births78) + geom_line(aes(x = date, y = births, color = wday))
Or we could have points and lines; note that we have moved the aes commands into the ggplot command because it applies to all the layers.
ggplot(data = Births78, aes(x = date, y = births, color = wday)) + geom_line() + geom_point()
We could have put the data outside the ggplot command using the magrittr pipe:
Births78 %>%
ggplot(aes(x = date, y = births, color = wday)) + geom_line() + geom_point()
We need to do setting rather than mapping for things like colors; inside the individual geom you set the color you want (different to ggvis where setting is done with :=). Within ggplot you can only map nor set.
Births78 %>%
ggplot(aes(x = date, y = births)) + geom_point(color = "navy")
Combine the colored lines with navy points - notice that wday is in the aes in the geom_line whereas with geom_point we don’t call on aes because we’re simply setting (not mapping).
Births78 %>%
ggplot(aes(x = date, y = births)) +
geom_line(aes(color = wday)) +
geom_point(color = "navy")
Recall that we can check out the kinds of geoms that exist if we use the command
apropos, notice the use of the caret (^)
apropos("^geom")
## [1] "geom_abline" "geom_area" "geom_bar"
## [4] "geom_bin2d" "geom_blank" "geom_boxplot"
## [7] "geom_contour" "geom_crossbar" "geom_density"
## [10] "geom_density2d" "geom_dotplot" "geom_errorbar"
## [13] "geom_errorbarh" "geom_freqpoly" "geom_hex"
## [16] "geom_histogram" "geom_hline" "geom_jitter"
## [19] "geom_line" "geom_linerange" "geom_map"
## [22] "geom_path" "geom_point" "geom_pointrange"
## [25] "geom_polygon" "geom_quantile" "geom_raster"
## [28] "geom_rect" "geom_ribbon" "geom_rug"
## [31] "geom_segment" "geom_smooth" "geom_step"
## [34] "geom_text" "geom_tile" "geom_violin"
## [37] "geom_vline"
Key plot metric: Does my plot make the comparisions I am interested in easily and accurately.
We can generate a new geom, a geom_bar:
HELPrct %>%
ggplot(aes(x = substance)) +
geom_bar()
Notice that we were able to construct the bar chart even though our data weren’t glyph-ready with counts; ggplot did it for us.
HELPrct %>%
ggplot(aes(x = age)) +
geom_histogram(binwidth = 2)
* We also often want to use frequency polygons or kernel density functions:
HELPrct %>%
ggplot(aes(x = age)) +
geom_freqpoly(binwidth = 2)
HELPrct %>%
ggplot(aes(x = age)) +
geom_density()
But, note RP prefers to add density to a line plot because it look
HELPrct %>%
ggplot(aes(x = age)) +
geom_line(stat = "density")
Or we could have put the geom in the stat_density
HELPrct %>%
ggplot(aes(x=age)) +
stat_density( geom="line")
## ymax not defined: adjusting position using y instead
Now generate your own graph looking at the average consumption of graphs (which I did by groups)
HELPrct %>%
ggplot(aes(x = i1)) +
geom_line(stat = "density", aes(color = factor(substance)))
$, [], [[]],There are 5 kinds of things (objects) in R:
BabyNamesyear"Treatment" or 42 (thanks DA)sd(); multiple arguments separated with commas)%>% takes something and puts it as the/an input to a function, e.g. something >%> function() rather than function(something, other stuff, ...).3 Main kinds of functions for Wrangling
To run the following you may need to install the packages NHANES, dplyr, and babynames. Run the command, e.g. install.packages("NHANES"). Remember to require these packages.
Now we run some commands to start wrangling the data.
require(NHANES)
require(dplyr)
require(babynames)
NHANESmajors <- filter(NHANES, Age >= 21)
nowsmoking <- select(NHANES, Age, SmokeNow)
nowsmoking %>%
head()
## Source: local data frame [6 x 2]
##
## Age SmokeNow
## 1 34 No
## 2 34 No
## 3 34 No
## 4 4 NA
## 5 49 Yes
## 6 9 NA
group_by(babynames, name, year)
## Source: local data frame [1,792,091 x 5]
## Groups: name, year
##
## year sex name n prop
## 1 1880 F Mary 7065 0.07238359
## 2 1880 F Anna 2604 0.02667896
## 3 1880 F Emma 2003 0.02052149
## 4 1880 F Elizabeth 1939 0.01986579
## 5 1880 F Minnie 1746 0.01788843
## 6 1880 F Margaret 1578 0.01616720
## 7 1880 F Ida 1472 0.01508119
## 8 1880 F Alice 1414 0.01448696
## 9 1880 F Bertha 1320 0.01352390
## 10 1880 F Sarah 1288 0.01319605
## .. ... ... ... ... ...
head(KidsFeet)
## name birthmonth birthyear length width sex biggerfoot domhand
## 1 David 5 88 24.4 8.4 B L R
## 2 Lars 10 87 25.4 8.8 B L L
## 3 Zach 12 87 24.5 9.7 B R R
## 4 Josh 1 88 25.2 9.8 B L R
## 5 Lang 2 88 25.1 8.9 B L R
## 6 Scotty 3 88 25.7 9.7 B R R
select(KidsFeet, birthyear, domhand) %>%
head()
## birthyear domhand
## 1 88 R
## 2 87 L
## 3 87 R
## 4 88 R
## 5 88 R
## 6 88 R
righties <- filter(KidsFeet, domhand == "R")
head(righties)
## name birthmonth birthyear length width sex biggerfoot domhand
## 1 David 5 88 24.4 8.4 B L R
## 2 Zach 12 87 24.5 9.7 B R R
## 3 Josh 1 88 25.2 9.8 B L R
## 4 Lang 2 88 25.1 8.9 B L R
## 5 Scotty 3 88 25.7 9.7 B R R
## 6 Edward 2 88 26.1 9.6 B L R
The pipe (%>%) makes it clear what the individual lines are doing.
KidsFeetArea <- KidsFeet %>%
mutate(area = length * width) %>%
head()
arrange orders your cases in terms of the variable you select in ascending order, e.g. arrange(area).
KidsFeet %>%
mutate(area = length * width) %>%
arrange(area) %>%
head()
## name birthmonth birthyear length width sex biggerfoot domhand area
## 1 Hayley 1 88 21.6 7.9 G R R 170.64
## 2 Kate 4 88 23.7 7.9 G R R 187.23
## 3 Caitlin 7 88 22.5 8.6 G R R 193.50
## 4 Hannah 3 88 22.9 8.5 G L R 194.65
## 5 Peggy 10 88 24.2 8.1 G L R 196.02
## 6 Laura 9 88 24.0 8.3 G R L 199.20
To arrange in descending order, you can add desc() within arrange.
KidsFeet %>%
mutate(area = length * width) %>%
arrange(desc(area)) %>%
head()
## name birthmonth birthyear length width sex biggerfoot domhand area
## 1 Mark 9 87 27.5 9.8 B R R 269.50
## 2 Cam 3 88 27.0 9.8 B L R 264.60
## 3 Glen 7 88 27.1 9.4 B L R 254.74
## 4 Edward 2 88 26.1 9.6 B L R 250.56
## 5 Scotty 3 88 25.7 9.7 B R R 249.29
## 6 Abby 2 88 26.1 9.5 G L R 247.95
We can also add another argument to arrange to break ties, e.g. adding birthmonth (I’m going to head for 10 cases):
KidsFeet %>%
mutate(area = length * width) %>%
arrange(area, desc(birthmonth)) %>%
head(10)
## name birthmonth birthyear length width sex biggerfoot domhand
## 1 Hayley 1 88 21.6 7.9 G R R
## 2 Kate 4 88 23.7 7.9 G R R
## 3 Caitlin 7 88 22.5 8.6 G R R
## 4 Hannah 3 88 22.9 8.5 G L R
## 5 Peggy 10 88 24.2 8.1 G L R
## 6 Laura 9 88 24.0 8.3 G R L
## 7 Damon 9 88 22.9 8.8 B R L
## 8 Caitlin 6 88 23.0 8.8 G L R
## 9 David 5 88 24.4 8.4 B L R
## 10 Caroline 12 87 24.0 8.7 G R L
## area
## 1 170.64
## 2 187.23
## 3 193.50
## 4 194.65
## 5 196.02
## 6 199.20
## 7 201.52
## 8 202.40
## 9 204.96
## 10 208.80
The Verb summarise creates a row with summary statistics that you select by defining them in functions with particular labels.
NewThing <- KidsFeet %>%
mutate(area = length * width)
AnotherNewThing <- NewThing %>%
summarise(ave_area = mean(area))
head(NewThing)
## name birthmonth birthyear length width sex biggerfoot domhand area
## 1 David 5 88 24.4 8.4 B L R 204.96
## 2 Lars 10 87 25.4 8.8 B L L 223.52
## 3 Zach 12 87 24.5 9.7 B R R 237.65
## 4 Josh 1 88 25.2 9.8 B L R 246.96
## 5 Lang 2 88 25.1 8.9 B L R 223.39
## 6 Scotty 3 88 25.7 9.7 B R R 249.29
head(AnotherNewThing)
## ave_area
## 1 222.7369
The important feature of summarise is that it’ll create one row with a variety of variables that you specify, i.e. other means, say the max of the variable, e.g.
AnotherNewThing <- NewThing %>%
summarise(ave_area = mean(area), max_area = max(area))
AnotherNewThing
## ave_area max_area
## 1 222.7369 269.5
We have looked at 5 commands so far:
filterselectmutatearrangesummariseWe shall add another, more complex command, group_by.
Let us
NewThing %>%
group_by(sex) %>%
summarise(ave_area = mean(area), max_area = max(area))
## Source: local data frame [2 x 3]
##
## sex ave_area max_area
## 1 B 231.0140 269.50
## 2 G 214.0242 247.95
Or we could just use KidsFeet rather than assigning things to dataframes like we did with NewThing.
KidsFeet %>%
mutate(area = length * width) %>%
group_by(sex) %>%
summarise(ave_area = mean(area), max_area = max(area))
## Source: local data frame [2 x 3]
##
## sex ave_area max_area
## 1 B 231.0140 269.50
## 2 G 214.0242 247.95
Or we could arrange by both dominant hand and birth year:
KidsFeet %>%
mutate(area = length * width) %>%
group_by(domhand, birthyear) %>%
summarise(ave_area = mean(area), max_area = max(area))
## Source: local data frame [4 x 4]
## Groups: domhand
##
## domhand birthyear ave_area max_area
## 1 L 87 216.1600 223.52
## 2 L 88 214.6850 240.30
## 3 R 87 245.7420 269.50
## 4 R 88 220.6769 264.60
It’s important to remain aware of the change in the cases when you use group_by. 1. You have changed the cases from the cases before group_by, e.g. the kids, and then you create a new case, e.g. a group of kids. 2. Not all the variables in the original data will appear in the output of group_by, the ones you group by and the ones you create will be there.
These are tips I picked out (thus idiosyncratic):
object '[object name]' not found that normally means you (or your student) have mis-spelt the variable name or dataframe name. (RP)~ ‘wiggles’! Sounds funny and is less scary then tilde. (RP)Importing the medicare data…
#medicare <- read.csv("/Users/shalliday/Google Drive/simondhalliday.github.io/cvc_workshop/medicare_fy2013.csv")
medicare <- read.csv("medicare_fy2013.csv")
#head(medicare)
#str(medicare)
Trying to embed a badly thought-out plot using the ggvis package:
library(dplyr)
library(ggvis)
medicare %>%
ggvis(~factor(Provider.State), ~Average.Covered.Charges) %>%
layer_bars() %>%
add_axis("x", properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
))